Summary
Column
Choosing a Project
Topic? Sports, Beer, Other
Supervised or unsupervised learning?
Data Source: Download, Web Scrape, Social Media
Tools: Python, R, Weka, Tableau, Excel
Choice
Data from Kaggle
Audio Analysis
Supervised Learning
Classification
Machine Learning in Python
Presentation & Report in R Markdown
Excel for results transfer
Goals
Classify audio clip subjects gender
Learn what audio features best separate genders
Conclusions
Criteria
Accuracy
-
KNN (PCA)
-
Random Forest
-
Log Regression
Male Accuracy
-
Log Regression
-
Log Regression (Normal)
-
KNN (PCA)
Female Accuracy
-
Random Forest
-
KNN (PCA)
-
Log Regression (Normal)
AUC
-
Random Forest
-
Log Regression
-
Log Regression (Normal)
ROC
Area Under the Curve
-
KNN: 0.8899249
-
Decision Tree: 0.9606488
-
SVM: 0.9611217
-
Log Reg: 0.9961107
-
KNN (PCA): 0.9921023
-
Random Forest: 0.9979454
-
SVM (PCA): 0.9930792
-
Log Reg (Normal): 0.9955755
Conclusion
Best Model
Best Model: Random Forest
2nd highest overall accuracy
1st Female accuracy
Highest Area Under the Curve
Decent Fitting Time
Faster Scoring Time
Improvements
Focus on a single method
Combine features to create new ones
Implement more advanced methods (Bagging/Boosting)
Extract features from raw audio files